Approximate lineage for probabilistic databases

نویسندگان

  • Christopher Ré
  • Dan Suciu
چکیده

In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Lineage: A Survey

Lineage, or provenance, in its most general form describes where data came from, how it was derived, and how it was updated over time. Information management systems today exploit lineage in tasks ranging from data verification in curated databases [1] to confidence computation in probabilistic databases [10, 12]. Here, we formalize and categorize lineage, discuss a set of selected papers, and ...

متن کامل

10 Years of Probabilistic Querying - What Next?

Over the past decade, the two research areas of probabilistic databases and probabilistic programming have intensively studied the problem of making structured probabilistic inference scalable, but—so far—both areas developed almost independently of one another. While probabilistic databases have focused on describing tractable query classes based on the structure of query plans and data lineag...

متن کامل

Trio: A System for Integrated Management of Data, Accuracy, and Lineage

Trio is a new database system that manages not only data, but also the accuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio project are to combine and distill previous work into a simple and usable model, design a query la...

متن کامل

An Overview on Querying and Learning in Temporal Probabilistic Databases

Probabilistic databases store, query and manage large amounts of uncertain information in an efficient way. This paper summarizes my thesis which advances the state-of-the-art in probabilistic databases in three different ways: First, we present a closed and complete data model for temporal probabilistic databases. Queries are posed via temporal deduction rules which induce lineage formulas cap...

متن کامل

Recording Provenance on Probabilistic Databases

Tracking data provenance (or lineage) has become increasingly important in many large-scale applications. Till now, a few methods have been proposed to record data provenance. However, most of them mainly focus on deterministic databases except Trio style lineage that aims at probabilistic databases. Processing provenance upon probabilistic database is even challenging because of the exponentia...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2008